Outlines

  • QC and Droplet processing - 3 different approaches for this
  • Merge multiple data sets - 4 different approaches for this
  • Pseudotime analysis with slingshot
  • CITE-seq data processing
  • SMART-seq processing

A more advanced scRNAseq workflow

overview

A more advanced scRNAseq workflow

Full image here of workflow is here: overview

scRNA seq is a very variable process and each dataset has unique QC problems. Though our simplified approach often works, we also need many more tools in our toolbox to handle more complex datasets. Often it is a case of trial and error to see what approaches work best for your data.

Cell Ranger


Cell Ranger

Cell Ranger is a suite of tools for single cell processing and analysis available from 10X Genomics.

In this session we will demonstrate the use of Cell Ranger Count tool to process our generated fastQ and create the required files. ]

]

Cell Ranger Download

  • Cell Ranger is available from the 10x genomics website.

  • Also available are pre-baked references for Human and Mouse genomes (GRCh37/38 and GRCm37)

] ]

Cell Ranger

  • Cell Ranger only runs on linux machines (CentOS/RedHat 7.0+ and Ubuntu 14.04+).
  • Typically, due to memory requirements, users run Cell Ranger on a remote server and not their own machines.
  • To download Cell Ranger and the required reference on remote server, we typically use the wget command [This will all be in temrinal on the server you are using.

Download the software

wget -O cellranger-7.2.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.2.0.tar.gz?Expires=1701688001&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=Puwpsqsf~wMQz5e~PwTvM2DRQO1XdJ~9zeLCWqX6tVbOx~dnf24hP1bwlmNhybr3SZUQ8C12ywcICMH6Au02wxiCRm1uuTxZ0Uvq8g~s8L8s6XFyhepdi6Qjq8dzXNGoxswg3hModjKWVptTWq-MTHBDZv~yTFB7QAM9lzHHXo6SPWg8Fnx30ngmtGC5tDReVOiJ3DY0hsFvZtG3HaQ-HEbnzEH3lre-0rpWMBlsQu-vZ4RnKE0o3Xv6pQsb6261M19nHcpCsGhDCkFjDDbradx~SNw5rpY-HMxM4SnRuaOOI0rYyDNn7xdTat3eFj7rlgATXRaYx7SYNqDYKSrNWw__"

Download reference for Human genome (GRCh38)

wget -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz

Cell Ranger set-up

  • Having downloaded the software and references, we can then unpack them.

Unpack software and references.

tar -xzvf cellranger-7.2.0.tar.gz
tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz
  • Finally we can add the cellranger directory to our PATH.
export PATH=/PATH_TO_CELLRANGER_DIRECTORY/cellranger-7.1.0:$PATH

Running Cell Ranger count

Now we have the downloaded Cell Ranger software and required pre-build reference for Human (GRCh38) we can start the generation of count data from scRNA-seq/snRNA-seq fastQ data.

Typically FastQ files for your scRNA run will have been generated using the Cell Ranger mkfastq toolset to produce a directory a FastQ files.

We can now use CellRanger count command with our reference and fastQ files to generate our count matrix and associated files.

If you are analyzing single nuclei RNA-seq data remember to set the –include-introns flag.

cellranger count --id=my_run_name \
   --fastqs=PATH_TO_FASTQ_DIRECTORY \
   --transcriptome=/PATH_TO_CELLRANGER_DIRECTORY/refdata-gex-GRCh38-2020-A

Working with custom genomes

If you are working with a genome which is not Human and/or mouse you will need to find another source for your Cell Ranger reference.

  • Luckily many references are pre-built by other consortiums.
  • We can build our own references using other tools in Cell Ranger.

Creating your own reference

To create your own references you will need two additional files.

  • FASTA file - File containing the full genome sequence for the reference of interest
  • GTF file - File containing the gene/transcript models for the reference of interest.

FASTA file (Genome sequence)

  • The reference genome stored as a collection of contigs/chromosomes.
  • A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
  • Typically comes in FASTA format.
    • “>” line contains information on contig
    • Lines following contain contig sequence

igv

GTF (Gene Transfer Format)

igv

  • Used to genome annotation.

  • Stores position, feature (exon) and meta-feature (transcript/gene) information.

  • Importantly for Cell Ranger Count, only features labelled as exon (column 3) will be considered for counting signal in genes

  • Many genomes label mitochondrial genes with CDS and not exon so these must be updated

Using Cell Ranger mkgtf

Now we have the gene models in the GTF format we can use the Cell Ranger mkgtf tools to validate our GTF and remove any unwanted annotation types using the attribute flag.

Below is an example of how 10x generated the GTF for the Human reference.

cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf \
Homo_sapiens.GRCh38.ensembl.filtered.gtf \
                   --attribute=gene_biotype:protein_coding \
                   --attribute=gene_biotype:lncRNA \
                   --attribute=gene_biotype:antisense \
                   --attribute=gene_biotype:IG_LV_gene \
                   --attribute=gene_biotype:IG_V_gene \
                   --attribute=gene_biotype:IG_V_pseudogene \
                   --attribute=gene_biotype:IG_D_gene \
                   --attribute=gene_biotype:IG_J_gene \
                   --attribute=gene_biotype:IG_J_pseudogene \
                   --attribute=gene_biotype:IG_C_gene \
                   --attribute=gene_biotype:IG_C_pseudogene \
                   --attribute=gene_biotype:TR_V_gene \
                   --attribute=gene_biotype:TR_V_pseudogene \
                   --attribute=gene_biotype:TR_D_gene \
                   --attribute=gene_biotype:TR_J_gene \
                   --attribute=gene_biotype:TR_J_pseudogene \
                   --attribute=gene_biotype:TR_C_gene

Using Cell Ranger mkref

Following filtering of your GTF to the required biotypes, we can use the Cell Ranger mkref tool to finally create our custom reference.

cellranger mkref --genome=custom_reference \
--fasta=custom_reference.fa  \
--genes=custom_reference_filtered.gtf

Cell Ranger - Output files


Cell Ranger - Outputs

Having completed the Cell Ranger count step, the user will have created a folder named as set by the –id flag for count command.

Within this folder will be the outs/ directory containing all the outputs generated from Cell Ranger count.

Cell Ranger - Count Matrices

The count matrices to be used for further analysis are stored in both MEX and HDF5 formats within the output directories.

The filtered matrix only contains detected, cell-associated barcodes whereas the raw contains all barcodes (background and cell-associated).

MEX format - filtered_feature_bc_matrix - raw_feature_bc_matrix

HDF5 format - filtered_feature_bc_matrix.h5 - raw_feature_bc_matrix.h5

Cell Ranger - BAM files

The outs directory also contains a BAM file of alignments for all barcodes against the reference (possorted_genome_bam.bam) as well as an associated BAI index file (possorted_genome_bam.bam.bai).

This BAM file is sometimes used in downstream analysis such as scSplit/Velocyto as well as for the generation of signal graphs such as bigWigs. ]

igv

igv

]

Cell Ranger - Cloupe files

Cell Ranger also outputs files for visualisation within its own cloupe browser - cloupe.cloupe.

This allows for the visualisation of scRNA-seq/snRNA-seq as a t-sne/umap with the ability to overlay metrics of QC and gene expression onto the cells in real time

igv

Cell Ranger - Metrics and Web Summary

Cell Ranger will also output summaries of useful metrics as a text file (metrics_summary.csv) and as a intuitive web-page.

Metrics include

  • Counts/UMIs per cell.
  • Number of cells detected.
  • Alignment quality.
  • Distribution of reads in genomic features.
  • Sequencing saturation
  • t-sne/UMAP with default clustering.

QC is essential

There are many potential issues which can arise in scRNA-seq/snRNA-seq data including -

  • Empty droplets.
  • Low quality cell (dead or dying)
  • Ambient RNA contamination.
  • Doublet detection ]

igv

]

Assessment of the overall quality of a scRNA-seq/snRNA-seq experiment after Cell Ranger can give our first insght into any issues we might face.

Cell Ranger - Web Summary QC


Web Summary overview

The web summary html file contains an interactive report describing the most essential QC for your single cell experiment as well as initial clustering and dimension reduction for your data.

The web summary also contains useful information on the input files and the versions used in this analysis for later reproducibility. ]

]

Sample panel

The first thing we can review is the Sample information panel.

  • Sample ID - Sample name (Assigned in cellranger count)
  • Chemistry - The 10x chemistry used
  • Include introns - Whether counting was run to include intron counts (typical for single neuron RNA-seq).
  • Reference Path and Transcriptome - References used in analysis
  • Pipeline Version - Version of Cell Ranger used.

]

]

Sequencing panel

The Sequencing panel highlights information on the quality of the illumina sequencing.

  • Number of reads - Total number of paired reads in library
  • Valid barcodes - Number of barcodes matching barcodes in whitelist (known to be kit ~ 1 million).
  • Valid UMIs - Total number of UMIs that are not all one base and contain no unknown bases.
  • Sequencing saturation - Unique valid barcode/UMI versus all valid barcode/UMI
  • Q30 scores - Assessment of sequencing qualities for barcode/umi/index/RNA reads.

]

]

Sequencing panel

Key Metrics:

* Q30 Bases in RNA read > 65% (usually > 80%)
    + Reflect to Sequencing quality
    + Need to check with sequencing service supplier
* Sequencing saturation > 40% (usually range 20% ~ 80%)
    + Reflects the complexity of libraries
    + shall consider, but not necessarily, re-construct library while it too low 

Mapping panel

The Mapping panel highlights information on the mapping of reads to the reference genome and transcriptome.

  • Reads mapped to genome - Total number of reads mapping to the genome
  • Reads mapped confidently to genome - Reads mapping uniquely to the genome
  • Reads mapped confidently to exonic/Intronic - Reads mapping uniquely to the exons or introns
  • Reads mapped confidently to transcriptome - Reads mapping to a unique gene (and consistent with slice junctions).

]

]

Mapping panel

Key Metrics:

* Mapped to Genome > 60% (usually range 50% ~ 90%)
    + Mapping rate to reference genome
    + Check reference genome version if too low
* Reads Mapped Confidently to Transcriptome > 30% (usually > 60%)
    + Reflection of annotation to transcriptome
    + Check annotation if too low

Cells panel

The Cells panel highlights some of the most important information in the report, the total number of cells captured and the distribution of counts across cells and genes.

  • Estimated number of cells - Total number of barcodes associated to at least one cell.
  • Fraction reads in cells - Fraction of reads from valid barcode, associated to a cell and mapped to transcriptome.
  • Median reads per cell - Median number of transcriptome reads within cell associated barcodes
  • Median genes per cell - - Median number of genes detected (at least 1 count) per cell associated barcodes.

]

]

Cells panel

Key Metrics:

  • Fraction Reads in Cells > 70% (usually > 85%): + Reflects the ambient RNA contamination + Consider correcting for ambient RNA if < 90%
    • Median reads per cell > 20,000/cell and estimated number of cells 500 ~ 10,000
      • May be caused by the failure of cell identification if the values were not in normal range
      • Need to check knee plot and re-evaluate cell number

Knee plot

The Cell panel also includes and interactive knee plot.

The knee plot shows:-

  • On the x-axis, the barcodes ordered by the most frequent on the left to the least frequent on the right

  • On the y-axis, the frequency of each ordered barcode.

  • Highlighted in blue are the barcodes marked as associated to cells.

]

]

Knee plot

It is apparent that barcodes labelled blue (cell-associated barcodes) do not have a cut-off based on UMI count.

In the latest version of Cell Ranger a two step process is used to define cell-associated barcodes based on the EmptyDrops method (Lun et al.,2019).

  • First high RNA containing cells are identified based on a UMI cut-off.
  • Second, low UMI containing cells are used as a background training set to identify additonal cell-associated barcodes not called in first step.

If required, a –force-cells flag can be used with cellranger count to identify a set number of cell-associated barcodes.

]

]

Knee plot

The Knee plot also acts a good QC tools to investigate differing types of single cell failure.

Whereas our previous knee plot represented a good sample, differing knee plot patterns can be indicative of specific problems with the single cell protocol.

In this example we see no specific cliff and knee suggesting a failure in the integration of oil, beads and samples (wetting failure) or a compromised sample.

]

]

Knee plot

If there is a clog in the machine we may see a knee plot where the overall number of samples is low.

]

]

Knee plot

There may be occasions where we see two sets of cliff-and-knees in our knee plot.

This could be indicative of a heterogenous sample where we have two populations of cells with differing overall RNA levels.

Knee plots should be interpreted in the context of the biology under investigation.

]

]

Analysis page

The web-summary also contains an analysis page where default dimension reduction, clustering and differential expressions between clusters has been performed.

Additionally the analysis page contains information on sequencing saturation and gene per cell vs reads per cell.

]

]

t-sne and clustering

The t-sne plot shows the distribution and similarity within your data.

  • Review for high and low UMI cells driving t-sne structure and/or clustering.
  • Expected separation between and structure across clusters may be observed within the t-sne plot.
  • Identify expected clusters based on expression of marker genes

]

]

Sequence and Gene saturation

The sequence saturation and Median genes per cell plots show these calculations (as show on summary page) over successive downsampling of the data.

By reviewing the curve of the down sampled metrics we can assess whether we are approaching saturation for either of these metrics.

]

igv

igv

]

Droplet processing


Droplet processing

overview

Droplet processing

Often empty droplets can still make their way past Cell Ranger, or we might see an issue in the Cell Ranger web report itself. If this happens we might want to do custom filtering outside of Cell Ranger.

One cause of these issues could be ambient RNAs. Ambient RNAs from lysed cells make empty droplets seem like they contain a cell or contaminate droplets that already have a cell.

Lets look at the PBMC 10k data from 10X Genomics as an example to deal with these issues. This time we will use the raw matrix. This way we don’t use the filtering that Cell Ranger does.

CellBender

CellBender is a command line toolkit developed to eliminate technical artifacts from scRNA-Seq data such as empty droplets and ambient RNAs link.

In the current version, it contains remove-background module, which is applied to: * Detect empty droplets and ambient RNAs from raw count matrix in CellRanger output. * Remove any droplets considered to be empty, correct the interference of ambient RNAs.

  • CellBender is based on machine-learning strategy. It’s quite time-consuming. We can speed up the processing by using CUDA version of CellBender. However, this means you will need access to a GPU.

CellBender command

We take the raw matrices from Cell Ranger and run CellBender. The cuda argument will speed up the function, but rquires your run it on a GPU.

input_h5=the_raw_matrix_in_h5_format_from_cellranger #essential
output_h5=assign_the_h5_file_path_for_the_cellbender_corrected_matrix # essential
fpr=threshold_of_FALSE_POSITIVE_RATE # default 0.01
epochs=number_of_epochs_to_train # default 150
num_train=number_of_times_to_attempt_to_train_the_model # default 1. would speed up while setting greater

cellbender remove-background --input $input_h5 --output $output_h5 --fpr $fpr --epochs $epochs --num-training-tries $num_train --cuda

CellBender and Seurat

The h5 file produced by CellBender is currently incompatible with Seurat. We can use the following function to convert the h5 file to a Seurat object.

CellBender has a guide how to convert the h5 file to a Seurat object here, using PyTables. This is another command line tool.

ptrepack --complevel 5 celbender_filtered.h5:/matrix celbender_filtered_forseurat.h5:/matrix

Cell Ranger vs. CellBender

We have some processed results here from CellBender. We can compare this to our filtered matrix from Cell Ranger.

library(Seurat)
cbFilt_mtx <- Read10X_h5("data/cbFilt_PBMCv3_20230324_filtered.h5")

Cell Ranger vs. CellBender

We can also need to read back in our filtered PBMC data from Cell Ranger.

download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz",
    "10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
filt_mtx <- Seurat::Read10X("filtered_feature_bc_matrix/")

Cell Ranger vs. CellBender

dim(cbFilt_mtx)
## [1] 36601 12244
dim(filt_mtx)
## [1] 36601 11485

Create some functions

We are going to try out a few tools and approaches. We want to wrap some of our analysis steps into a function to simplify rerunning things.

Normalization: * Log normalization with scale factor = 10,000 * Find Variable features with vst, select top 2000 variable features

data_proc <- function(seu) {
    seu <- NormalizeData(seu, normalization.method = "LogNormalize", scale.factor = 10000)
    seu <- FindVariableFeatures(seu, select.method = "vst", nfeatures = 2000)
    return(seu)
}

Create some functions

Make clusters: * Scale data with ScaleData() * Principle Component Analysis by using RunPCA() with npcs=30 PCs * Make non-linear dimensional reduction in UMAP by using RunUMAP() with dims=1:10 * Estimate Neighbors by using FindNeighbors() with dims=1:10 * Identify clusters with FindClusters() by using resolution=0.5

quick_clust <- function(seu) {
    set.seed(42)
    seu <- ScaleData(seu, verbose = FALSE)
    seu <- RunPCA(seu, npcs = 30, verbose = FALSE)
    seu <- RunUMAP(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
    seu <- FindNeighbors(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
    seu <- FindClusters(seu, resolution = 0.5, verbose = FALSE)
    return(seu)
}

Data processing

message("processing matrix from CellRanger")
seu <- CreateSeuratObject(filt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_filt <- seu

message("processing matrix from CellBender")
seu <- CreateSeuratObject(cbFilt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_cbFilt <- seu

Evaluate with marker gene expression

DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()

Evaluate with marker gene expression

DimPlot(seu_cbFilt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

Evaluate with marker gene exprssion

mark_gene <- c("CD3E", "CCR7")
VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Evaluate with marker gene exprssion

mark_gene <- c("CD3E", "CCR7")
VlnPlot(seu_cbFilt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  14768280  788.8   22661650 1210.3  22661650 1210.3
## Vcells 139161274 1061.8  630786096 4812.6 694851574 5301.3

Droplet processing

overview

Droplet evaluation in R


Detect TRUE cells

  • The knee plot is available in the Web Summary from Cell Ranger.
  • Cell Bender provides methods to detect the automate filtration and correction.
  • We can also draw the knee plot by using DropletUtils in R.

Load dataset

We need to load in the dataset, but it is important not to set any filters yet. This means we will be using the raw matrix.

library(DropletUtils)

download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz",
    "10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")

raw_mtx <- Seurat::Read10X("raw_feature_bc_matrix")
dim(raw_mtx)
## [1]   36601 2099284

Load dataset

Calculate the counts for each cell barcode and their rank.

bcrank <- barcodeRanks(raw_mtx)
head(bcrank, 2)
## DataFrame with 2 rows and 3 columns
##                         rank     total    fitted
##                    <numeric> <integer> <numeric>
## AAACCCAAGAAACCAT-1   1752547         0        NA
## AAACCCAAGAAACCCA-1    914872         1        NA

Draw Knee plot

To draw a custom knee plot we need to remove duplicated ranks and grab the knee and inflection points.

uniq <- !duplicated(bcrank$rank)
bcrank <- bcrank[uniq, ]

knee <- metadata(bcrank)$knee
message(paste0("knee point: ", knee))
inflection <- metadata(bcrank)$inflection
message(paste0("inflection point: ", inflection))

Draw Knee plot

We can now draw our knee plot using ggplot().

ggplot(as.data.frame(bcrank), aes(x = rank, y = total)) + geom_point() + geom_hline(yintercept = knee,
    linetype = "dashed", color = "blue") + geom_hline(yintercept = inflection, linetype = "dashed",
    color = "green") + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10") +
    labs(x = "cell barcode ranked by counts", y = "UMI counts of each cell barcode") +
    theme_classic()

Draw Knee plot

We can now draw our knee plot using ggplot(). The knee is labelled in blue. The inflection is labelled in green.

Detect empty droplets

We can detect empty droplets with DropletUtils::emptyDrops(). - The p-value for each cell is performed by Monte Carlo simulation basing on the deviation of a given cell to the ambient RNA pool.

e.out <- emptyDrops(raw_mtx)
e.out <- e.out[order(e.out$FDR), ]
head(e.out)
## DataFrame with 6 rows and 5 columns
##                        Total   LogProb    PValue   Limited       FDR
##                    <integer> <numeric> <numeric> <logical> <numeric>
## AAACCCAAGGTAGTCG-1      6088  -7854.52 9.999e-05      TRUE         0
## AAACCCACAATCCAGT-1     14628 -17829.20 9.999e-05      TRUE         0
## AAACCCACACCGTCTT-1      7642  -9887.60 9.999e-05      TRUE         0
## AAACCCACATCTCATT-1      9834 -13742.83 9.999e-05      TRUE         0
## AAACCCACATTGACTG-1      7409 -10416.66 9.999e-05      TRUE         0
## AAACCCAGTCACTGAT-1      5834  -8026.81 9.999e-05      TRUE         0

Detect empty droplets

The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.

ggplot(as.data.frame(e.out), aes(x = Total)) + geom_histogram() + geom_vline(xintercept = knee,
    color = "red", linetype = "dashed") + labs(x = "UMI counts per cell", y = "Frequency") +
    scale_y_continuous(trans = "log10") + scale_x_continuous(trans = "log10") + theme_classic()

Detect empty droplets

The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likelys droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.

Filter empty droplets

In the DropletUtils manual, the FDR threshold is set to 0.001. In most cases. We can filter based on this cutoff.

table(e.out$FDR < 0.001)
## 
## FALSE  TRUE 
##   605 11580
cell_sel <- rownames(e.out)[e.out$FDR < 0.001]
filt_mtx <- raw_mtx[, colnames(raw_mtx) %in% cell_sel]

Ambient RNA evaluation in R


Ambient RNAs

  • The cell-free RNA molecules randomly spread in the solution. It can be enclosed into droplets during the steps of library preparation. So, it would be falsely increase the UMI counts of genes.

  • Ambient RNAs present two signature

    • The majority are found in the empty droplets population.
    • No specificity. This contamination can make a single cell look like it expresses two (or more) mutually exclusive marker genes.
  • According to these two properties, we could estimate ambient RNA contamination rates and correct it. Here we will demonstrate how to correct ambient RNA contamination with DropletUtils and SoupX, respectively.

Ambient RNAs

overview

Ambient RNAs

We can see here an example of very high ambient RNA. This marker gene is supposed to be very specific. The UMAP also does not look great.

Ambient RNAs

The same dataset has very high MT contamination. This is a good indicator of ambient RNA contamination.

Correct ambient RNAs

Lets start by estimating our ambient RNA contamination. This is from DropletUtils.

amb <- metadata(e.out)$ambient[, 1]
head(amb)
##   AL627309.1   AL627309.3   AL627309.5   AL627309.4   AL669831.2    LINC01409 
## 8.369426e-07 9.859354e-08 5.635719e-06 9.859354e-08 9.859354e-08 6.511074e-06

Correct ambient RNAs

We can read back in our filtered PBMC data.

download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz",
    "10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
filt_mtx <- Seurat::Read10X("filtered_feature_bc_matrix/")

Correct ambient RNAs

Then filter out empty droplets.

filt_mtx_drop <- filt_mtx[rownames(filt_mtx) %in% names(amb), ]
seu <- CreateSeuratObject(filt_mtx_drop)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
sce <- as.SingleCellExperiment(seu, assay = "RNA")

Correct ambient RNAs

plotUMAP(sce, colour_by = "seurat_clusters")

Correct ambient RNAs

Lets put our uncorrected, filtered dataset in a list.

seu_list <- list()
seu_list[["woCorr"]] <- seu

Correct ambient RNAs

We can use the to removeAmbience() function to remove ambient RNAs.

out <- removeAmbience(counts(sce), ambient = amb, groups = sce$seurat_clusters)
rownames(out) <- rownames(sce)
colnames(out) <- colnames(sce)

Correct ambient RNAs

Now that we have run a correction, we want to reprocess and cluster our data.

seu_list[["withCorr"]] <- CreateSeuratObject(out)
seu_list[["withCorr"]] <- data_proc(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- ScaleData(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- quick_clust(seu_list[["withCorr"]])

Evaluate with marker genes

First lets look at the UMAP without ambient RNA correction.

DimPlot(seu_list$woCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

Evaluate with marker genes

Lets now compare to UMAP wit ambient RNA correction.

DimPlot(seu_list$withCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

Evaluate with marker genes

Lets compare to some of our marker genes.

mark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14")
mark_gene
## [1] "CCR7"  "CD8A"  "MS4A1" "CD14"

Evaluate with marker genes

Lets now look at a violin plot in the uncorrected data.

VlnPlot(seu_list$woCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Evaluate with marker genes

And then again in the corrected.

VlnPlot(seu_list$withCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Ambient RNA and SoupX

SoupX is an alternative tool for droplet processing, developed for ambient RNA detection and correction. link. It includes the following steps. - Estimate ambient RNA expression in the empty droplets. - Estimate contamination rate by using the ambient RNA expression levels across clusters. - Correct ambient RNA contamination.

Prepare essential files

We need our filtered and unfiltered matrix. Plus our processed filtered Seurat object.

clust <- setNames(seu$seurat_clusters, Cells(seu))
seu_filt <- seu

Import Matrices

Import our matrices into the SoupX channel object.

library(SoupX)
soupOBJ <- SoupChannel(raw_mtx, filt_mtx)
soupOBJ <- setClusters(soupOBJ, clust)

Automatic estimates

The autoEstCont() can be used to detect contamination. We can then export the correct counts with adjustCount().

soupOBJ <- autoEstCont(soupOBJ)

autoCor_mtx <- adjustCounts(soupOBJ)

Build a Seurat object

We will now put our corrected matrix into a Seurat object.

seu <- CreateSeuratObject(autoCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_autoCorr <- seu

Estimate contamination

We can estimate contamination using known marker genes.

mark_list <- list(`CD4 T-cell` = c("IL7R", "CCR7", "S100A4"), `CD8 T-cell` = c("CD8A"),
    `B-Cell` = c("MS4A1"), Monocyte = c("CD14", "FCGR3A"), DC = c("FCER1A"), NK = c("NKG7",
        "GNLY"), Platelet = c("PPBP"))

use_toEst <- estimateNonExpressingCells(soupOBJ, nonExpressedGeneList = mark_list)
soupOBJ <- calculateContaminationFraction(soupOBJ, mark_list, useToEst = use_toEst)
rho <- unique(soupOBJ$metaData$rho)
rho
## [1] 0.0193281

Export matrix after correction

soupOBJ <- setContaminationFraction(soupOBJ, rho, forceAccept = TRUE)
estCor_mtx <- adjustCounts(soupOBJ)

Add exported matrix to Seurat

seu <- CreateSeuratObject(estCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_estCorr <- seu

Evaluate marker genes

mark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
    "PPBP")
mark_gene
## [1] "CCR7"   "CD8A"   "MS4A1"  "CD14"   "FCGR3A" "FCER1A" "GNLY"   "NKG7"  
## [9] "PPBP"

Without Correction

DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()

Without Correction

VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

With Correction (auto)

DimPlot(seu_autoCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

With Correction (auto)

VlnPlot(seu_autoCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

With Correction (est)

DimPlot(seu_estCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
    NoLegend()

With Correction (est)

VlnPlot(seu_estCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)

Normalization and transformation


SCTransform

overview

SCTransform

In our simplified workflow we used a log normalization approach. Recently an alternative approach has been gaining popularity: SCTransform. Lets compare this back to our log normalization.

SCTransform

  • SCTransform is a normalization based on negative binomial regression.
  • It is performed with SCTransform() function of Seurat.
  • This function equal to the combinations of NormalizeData(), FindVariableFeatures(), and ScaleData().
seu_obj <- SCTransform(seu_obj, variable.features.n = 3000)
seu_obj
## An object of class Seurat 
## 36778 features across 11428 samples within 2 assays 
## Active assay: SCT (18389 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA

SCTransform

The normalized data is stored in assay SCT. We can see this is now the active assay. The log results are stored in the RNA assay. We can control which is the active assay with the DefaultAssay() function.

DefaultAssay(seu_obj) <- "RNA"

DefaultAssay(seu_obj) <- "SCT"
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  14775717  789.2   22661650 1210.3  22661650 1210.3
## Vcells 350698942 2675.7  581445107 4436.1 895065775 6828.9

SCTransform vs log normalization

We can extract out the results of our normalized and scaled data with the GetAssayData() function.

log_mat <- GetAssayData(seu_obj, assay = "RNA", slot = "data")
log_mat <- as.matrix(log_mat)
log_avgExp <- rowMeans(log_mat)

sct_mat <- GetAssayData(seu_obj, assay = "SCT", slot = "data")
sct_mat <- as.matrix(sct_mat)
sct_avgExp <- rowMeans(sct_mat)

log normalization vs SCTransform

We can test the concordance with Spearman’s correlation.

dat <- data.frame(logNorm = log_avgExp, SCT = sct_avgExp)
cor_val <- cor.test(log_avgExp, sct_avgExp, method = "spearman")

ggplot(dat, aes(x = logNorm, y = SCT)) + geom_point() + geom_smooth() + labs(x = "Log_Normalization",
    y = "SCTransform", subtitle = paste0("rho=", round(cor_val$estimate, 3), "; p-value=",
        cor_val$p.value[1])) + theme_classic()

log normalization vs SCTransform

We can see there is very high similarity between the results.

Merging Datasets


An advanced scRNAseq workflow

overview

Merge multiple datasets

We will discuss a number of approaches and methods we may need to use to merge datasets.

  • No corrections
  • Reciprocal Principle Component Analysis (RPCA)
  • Mutual Nearest Neighbors (MNN)
  • Harmony

Example dataset

We will be using the IFNB-Stimulated and Control PBMCs from Seurat Data. The easiest way to get this data is from their custom package SeuratData which is hosted on GitHub.

We will quickly show you how to get this data, but it isn’t necessary to run these steps.

remotes::install_github("satijalab/seurat-data")
library(Seurat)
library(SeuratData)
InstallData("ifnb")
LoadData("ifnb")
head(ifnb, 2)
##                   orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL          CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL          CD14 Mono

Example dataset

This dataset is already loaded in as a Seurat object, and it is already merged. So we need to split it, so we can merge it ourselves! The groups are in the “stim” column of the metadata.

table(ifnb$stim)
ifnb_list <- Seurat::SplitObject(ifnb, split.by = "stim")
ifnb_list

At this point you could save your data object for later using the save() or saveRDS() function.

save("ifnb_list", file = "data/seuOBJ_IFNB_splitByStim.RData")

Load in your dataset

We have the ifnb_list saved in an RData object, which you can load in.

load("data/seuOBJ_IFNB_splitByStim.RData")

Simple Merge


Concatenate the datasets

We can use merge() function to integrate our two data sets, as they are. We need to provide Group information and a name for the project as arguments.

ifnb_merge <- merge(ifnb_list$CTRL, ifnb_list$STIM, add.cell.ids = c("CTRL", "STIM"),
    project = "ifnb_seuMerge")
head(ifnb_merge, 2)
##                        orig.ident nCount_RNA nFeature_RNA stim
## CTRL_AAACATACATTTCC.1 IMMUNE_CTRL       3017          877 CTRL
## CTRL_AAACATACCAGAAA.1 IMMUNE_CTRL       2481          713 CTRL
##                       seurat_annotations
## CTRL_AAACATACATTTCC.1          CD14 Mono
## CTRL_AAACATACCAGAAA.1          CD14 Mono

Process and make clusters

We can use our processing and clustering functions to analyse our merged dataset.

ifnb_merge <- data_proc(ifnb_merge)
ifnb_merge <- quick_clust(ifnb_merge)

UMAP

Our UMAP shows our cells are distinct, depending on the condition.

DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)

Evaluate with cell types

A given cell type should often be clustered together. This pattern indicates the opposite. Different cell types are split into distinct groups depending on the sample.

DimPlot(ifnb_merge, group.by = "seurat_annotations", pt.size = 0.2, split.by = "stim")

STIM/CTRL don’t group

  • This result indicates the difference between STIM and CTRL groups is huge.
  • Is this a true biological phenomena or is it a batch effect?

Merge with reciprocal PCA


Merge with reciprocal PCA

Reciprocal PCA minimizes the batch effects while merging different data sets.

This workflow is modified from canonical correlation analysis (CCA), which is widely used to merge batches.

RPCA workflow

  1. Normalize data sets. We can use our data_proc().
  2. Select features for integration by using SelectIntegrationFeatures().
  3. Scale data and process PCA by using the features identified for integration
  4. Identify Anchors for integration by using FindIntegrationAnchors()
  5. Integrate data by using IntegratedData()
  6. Process quick_cluster() and evaluate results with UMAP

Prepare for RPCA merge

First, we prepare the data for integration. We will normalize the data sets separately. Than, we need to identify features for integration. This is similar to the VariableFeatures function we ran on a single dataset. Lastly we run scaling and PCA, using these features.

ifnb_list_rpca <- lapply(ifnb_list, data_proc)

feats <- SelectIntegrationFeatures(ifnb_list_rpca)

ifnb_list <- lapply(ifnb_list, function(seu, feats) {
    seu <- ScaleData(seu, features = feats, verbose = FALSE)
    seu <- RunPCA(seu, features = feats, verbose = FALSE)
    return(seu)
}, feats)

Integrating data in RPCA merge

We can then identify anchors. These are the features through which we will integrate our data. Once we have these features, we can then integrate our data sets together.

anchors <- FindIntegrationAnchors(ifnb_list, anchor.features = feats, reduction = "rpca")

ifnb_merge <- IntegrateData(anchorset = anchors)

ifnb_merge
## An object of class Seurat 
## 16053 features across 13999 samples within 2 assays 
## Active assay: integrated (2000 features, 2000 variable features)
##  1 layer present: data
##  1 other assay present: RNA

Evaluating RPCA using clusters

To evaluate how well the merge has worked we must check the clustering. Again we must scale, and then use our quick_clust function.

We can now see that our two data sets overlay with each other.

ifnb_merge <- ScaleData(ifnb_merge)

ifnb_merge <- quick_clust(ifnb_merge)

DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)

Evaluating RPCA using clusters

We can now see that our two data sets overlay with each other.

Evaluating RPCA using clusters

We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now.

tbl <- table(ifnb_merge$stim, ifnb_merge$seurat_clusters)

barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")

Evaluating RPCA using cell types

We can also check the cell types. Using UMAPs we can split and compare across our conditions and cell types.

DimPlot(ifnb_merge, group.by = "seurat_annotations", split.by = "stim", pt.size = 0.2)

Evaluating RPCA using cell types

Using heatmaps we can also check how specific each cluster is to each cell type.

library(pheatmap)

tbl <- table(ifnb_merge$seurat_clusters, ifnb_merge$seurat_annotations)
pheatmap(tbl, scale = "column")

Merge data with MNN correction


Merge data with MNN

Mutual Nearest Neighbors (MNN) approach uses paired cells instead of anchor genes to find the difference between batches.

The MNN correction was published by Marioni et al, Nature (2018). link

Steps for merging data with MNN

  1. Convert into SingleCellExperiment
  2. Identify features across samples
  3. Normalization
  4. Identify variables
  5. Merge data with MNN correction
  6. Build UMAP and clusters
  7. Evaluate with UMAP
  8. Evaluate composition of samples in each cluster

Preparing to merge data with MNN

First we have to convert Seurat object to SingleCellExperiment object.

sce_list <- lapply(ifnb_list, function(seu) {
    sce <- as.SingleCellExperiment(seu, assay = "RNA")
    rowData(sce)$SYMBOL <- rownames(sce)
    return(sce)
})


sce_list
## $CTRL
## class: SingleCellExperiment 
## dim: 14053 6548 
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(6548): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTTCGC.1
##   TTTGCATGGTCCTC.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):
## 
## $STIM
## class: SingleCellExperiment 
## dim: 14053 7451 
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(7451): AAACATACCAAGCT.1 AAACATACCCCTAC.1 ... TTTGCATGCTAAGC.1
##   TTTGCATGGGACGA.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):

Preparing to merge data with MNN

As with RPCA we need to model and identify highly variable genes. For this we will use the scran functions modelGeneVar() and getTopHVGs(). We simply provide our SingleCellExperiment object.

library(scran)
dec_list <- lapply(sce_list, modelGeneVar)

hvgc_list <- lapply(sce_list, getTopHVGs, prop = 0.1)

Preparing to merge data with MNN

Next we will find the features that are shared between samples. We must first define the shared “universe” of genes between our samples, and subset our variable genes to these. We can then combine the the variance to find variable features in both data sets.

universe <- intersect(rownames(sce_list$CTRL), rownames(sce_list$STIM))
sce_list <- lapply(sce_list, function(sce, universe) {
    sce <- sce[universe, ]
    return(sce)
}, universe)
dec_list <- lapply(dec_list, function(dec, universe) {
    dec <- dec[universe, ]
    return(dec)
}, universe)

combined_dec <- combineVar(dec_list$CTRL, dec_list$STIM)
chosen_hvgs <- combined_dec$bio > 0

Merge data with MNN

We will use the batchelor package to run MNN with the fastMNN() function. * d: number of principles evaluated * k: number of nearest neighbors to consider * subset.row: subset genes. Here, we are using the top variable features

library(batchelor)
mnn_res <- fastMNN(CTRL = sce_list$CTRL, STIM = sce_list$STIM, d = 50, k = 20, subset.row = chosen_hvgs)
mnn_res
## class: SingleCellExperiment 
## dim: 1958 13999 
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(1958): HES4 ISG15 ... CTD-2521M24.5 AJ006998.2
## rowData names(1): rotation
## colnames(13999): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTAAGC.1
##   TTTGCATGGGACGA.1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):

Merge data with MNN

The resulting SingleCellExperiment object contains the batch information. We can also retrieve a matrix of our dimension reduction results and corrected log counts with the reducedDim() and assay() functions.

table(mnn_res$batch)
## 
## CTRL STIM 
## 6548 7451
reducedDim(mnn_res, "corrected")[1:2, ]
##                       [,1]         [,2]        [,3]        [,4]        [,5]
## AAACATACATTTCC.1 0.4692084  0.002558369  0.24112460 -0.01932947 -0.02919753
## AAACATACCAGAAA.1 0.4849751 -0.221433871 -0.03194567 -0.04457301  0.02085704
##                           [,6]        [,7]        [,8]         [,9]       [,10]
## AAACATACATTTCC.1 -0.0001116517 -0.05219879 -0.01560541 -0.018456590  0.02042355
## AAACATACCAGAAA.1  0.0208653688  0.02518230 -0.05640069  0.001590786 -0.02042924
##                        [,11]       [,12]        [,13]       [,14]         [,15]
## AAACATACATTTCC.1 0.008021788 -0.04396844  0.005131837 -0.01141104 -0.0002610285
## AAACATACCAGAAA.1 0.006484710  0.02690261 -0.014495237  0.01603656  0.0417677341
##                         [,16]       [,17]        [,18]        [,19]       [,20]
## AAACATACATTTCC.1 -0.001044311 -0.02583133 -0.004660621  0.003453847 0.007833429
## AAACATACCAGAAA.1  0.002738321  0.01539083  0.028006124 -0.010241152 0.016709705
##                         [,21]       [,22]       [,23]        [,24]        [,25]
## AAACATACATTTCC.1 -0.004925765 -0.01688819 -0.01693241  0.003615093 -0.007882780
## AAACATACCAGAAA.1  0.019963264  0.02561860  0.01056931 -0.010869554 -0.005000097
##                        [,26]        [,27]        [,28]       [,29]        [,30]
## AAACATACATTTCC.1  0.01073247 0.0075159768  0.006422343 0.004441872 -0.002957031
## AAACATACCAGAAA.1 -0.02215599 0.0008773935 -0.003697461 0.014406621  0.003685269
##                        [,31]       [,32]        [,33]        [,34]        [,35]
## AAACATACATTTCC.1 -0.01006521 0.002491118 -0.001875366 -0.008862147  0.002032289
## AAACATACCAGAAA.1  0.01067559 0.021875400  0.011806193  0.003560579 -0.002324063
##                         [,36]        [,37]         [,38]        [,39]
## AAACATACATTTCC.1 -0.008468607 -0.007954116 -0.0009951656  0.002792573
## AAACATACCAGAAA.1  0.002952769 -0.006034000 -0.0009532853 -0.002280508
##                        [,40]        [,41]         [,42]         [,43]
## AAACATACATTTCC.1 0.002435519 -0.015530207  0.0009633955  0.0026935385
## AAACATACCAGAAA.1 0.010147310  0.003641573 -0.0034695643 -0.0008694543
##                         [,44]        [,45]        [,46]        [,47]
## AAACATACATTTCC.1 0.0002299275 -0.003499859 -0.001978661 -0.009881723
## AAACATACCAGAAA.1 0.0037271426 -0.002181876 -0.001995334 -0.005362670
##                         [,48]        [,49]         [,50]
## AAACATACATTTCC.1 -0.004609779  0.001163945 -0.0095770715
## AAACATACCAGAAA.1  0.001793127 -0.005419268  0.0003581869
assay(mnn_res, "reconstructed")[1:2, 1:5]
## <2 x 5> LowRankMatrix object of type "double":
##       AAACATACATTTCC.1 AAACATACCAGAAA.1 AAACATACCTCGCT.1 AAACATACCTGGTA.1
## HES4      -0.001084748     -0.001208982     -0.001217450      0.001368426
## ISG15     -0.112621047     -0.128283733     -0.111386938     -0.138798378
##       AAACATACGATGAA.1
## HES4      -0.001118786
## ISG15     -0.129904496

UMAP of data merged with MNN

Make UMAP using the scater package.

library(scater)
set.seed(1001)
mnn_res <- runUMAP(mnn_res, dimred = "corrected")
mnn_res$batch <- factor(mnn_res$batch)
plotUMAP(mnn_res, colour_by = "batch")

Clustering data merged with MNN

We will cluster using SNN graph approach from scran as this is comptaible with our SingleCellExperiment object.

snn.gr <- buildSNNGraph(mnn_res, use.dimred = "corrected")
cluster_mnn <- igraph::cluster_louvain(snn.gr)$membership
table(cluster_mnn)
## cluster_mnn
##    1    2    3    4    5    6    7    8    9   10   11   12   13 
##  806  839 1534  418 1072  835  933  848  924 1931 1350 2453   56

Clustering data merged with MNN

Lets check the UMAP for our clustered results.

mnn_res$cluster <- factor(cluster_mnn)
plotUMAP(mnn_res, colour_by = "cluster")

Clustering data merged with MNN

We can also check how many cells from each cluster belong to each group. With this approach you can see that there is a lot more group-specific clusters.

tbl <- table(mnn_res$batch, mnn_res$cluster)
barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")

Evaluate with cell types

We must annotate our SingleCellExperiment object with different cell type information. We can then visualize the cell types on our UMAP.

cellType <- lapply(sce_list, function(x) {
    res <- setNames(as.character(colData(x)$seurat_annotations), colnames(x))
    return(res)
})
cell_type <- c(cellType$CTRL, cellType$STIM)
mnn_res$cell_type <- cell_type[match(rownames(colData(mnn_res)), names(cell_type))]
mnn_res$cell_type <- factor(mnn_res$cell_type)

Evaluate with cell types

plotUMAP(mnn_res, colour_by = "cell_type")

Evaluate with cell types

We can also use a heatmap to look at the specificity of each cell type to each cluster. Most are cluster-specific.

tbl <- table(mnn_res$cluster, mnn_res$cell_type)
pheatmap(tbl, scale = "column")

MNN vs RPCA

Performance is dependent on experimental context.

Generally: * RPCA - More homogeneous
* MNN - More heterogeneous

Merge data with Harmony


Harmony

  • Harmony is an R package for single-cell data integration liike.
  • The user manual of Harmony is also available link.
  • There is also a python implementation of this package.
  • Here, we will demonstrate how to integrate harmony into Seurat regular workflow.

Prepare data for Harmony

We can prepare for Harmony in much the same way as we prepare for the simple Seurat merge: merge, normalize, scale, PCA and UMAP.

seu_obj <- merge(ifnb_list$CTRL, ifnb_list$STIM)
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)
seu_obj <- RunPCA(seu_obj)
seu_obj <- RunUMAP(seu_obj, reduction = "pca", dims = 1:10, reduction.name = "umap")

Prepare data for Harmony

As you can see we are back with our completely seperate groups.

DimPlot(seu_obj)

Merge data with Harmony

We can use the RunHarmony() function to implement the Harmony correction.

library(harmony)
seu_obj <- RunHarmony(seu_obj, group.by.vars = "stim", assay.use = "RNA")
seu_obj <- RunUMAP(seu_obj, reduction = "harmony", dims = 1:10, reduction.name = "umap_harmony")
DimPlot(seu_obj, reduction = "umap_harmony")

An advanced scRNAseq workflow

overview

Cell type annotation


Cell type annotation

To annotate the Single-cell data sets, we can evaluate the gene expression pattern of well known cell-type specific marker genes and make a manual annotation, like we did in section II. Here, we will introduce two more strategies to make cell type annotations automatically: 1. Mapping and annotating query datasets with Seurat using a reference data set. link 2. Make annotation with SingleR link

Prepare example datasets

The demo data, panc8, was fetched by using SeuratData(). It contains single-cell RNA-Seq of human pancreatic islet sequenced with different techniques.

The demo data contains * CELSeq dataset1 (GSE81076), as reference * CELSeq dataset2 (GSE85241), as reference * SMART-Seq2 (E-MTAB-5061), as reference * Fluidigm C1 (GSE86469), as query

As we did before we will split this dataset. We have also prepped this dataset for you, if you want to load the data object in directly.

library(Seurat)
library(SeuratData)
InstallData("panc8")
data("panc8")

Prepare example datasets

Split up the datasets by techniques using SplitObject().

panc8 <- UpdateSeuratObject(panc8)
seu_list <- SplitObject(panc8, split.by = "tech")
names(seu_list) <- c("celseq1", "celseq2", "smartseq", "fluidigmc1", "indrop")
load("data/panc8.RData")

Prepare reference dataset

For this we need a reference dataset and a query dataset. Merge celseq1 and celseq2 data with RPCA to make the reference.

seu_list <- lapply(seu_list, data_proc)
ref_list <- seu_list[c("celseq1", "celseq2")]
feats <- SelectIntegrationFeatures(ref_list)
ref_list <- lapply(ref_list, function(seu, feats) {
    seu <- ScaleData(seu, features = feats, verbose = FALSE)
    seu <- RunPCA(seu, features = feats, verbose = FALSE)
    return(seu)
}, feats)

Prepare reference dataset

Merge celseq1 and celseq2 data with RPCA

anchors <- FindIntegrationAnchors(ref_list, anchor.features = feats, reduction = "rpca")
ref_panc <- IntegrateData(anchorset = anchors)
ref_panc <- ScaleData(ref_panc)
ref_panc <- quick_clust(ref_panc)

Prepare reference dataset

How does the data look?

DimPlot(ref_panc, group.by = "seurat_clusters", split.by = "tech", pt.size = 0.2,
    label = TRUE)

Reference and query datasets

panc_list <- list(ref = ref_panc, query = seu_list$smartseq)
feats <- SelectIntegrationFeatures(panc_list)
panc_list <- lapply(panc_list, function(seu, feats) {
    seu <- ScaleData(seu, features = feats, verbose = FALSE)
    seu <- RunPCA(seu, features = feats, verbose = FALSE)
    return(seu)
}, feats)

Predict cell types for query

Identify the anchors between reference and query data sets, using FindTransferAnchors(). These are essential to transfer information from our reference to our query. We can then transfer the cell type information. For each cell in query dataset, the score for each given cell type was estimated by the gene expression pattern of anchor genes using the TransferData() function.

anchors <- FindTransferAnchors(reference = panc_list$ref, query = panc_list$query,
    dims = 1:30, reference.reduction = "pca")
pred_res <- TransferData(anchorset = anchors, refdata = panc_list$ref$celltype)
head(pred_res, 2)
##       predicted.id prediction.score.gamma prediction.score.acinar
## AZ_A2        gamma              0.9913817                       0
## AZ_B9        alpha              0.0000000                       0
##       prediction.score.alpha prediction.score.delta prediction.score.beta
## AZ_A2            0.008618342                      0                     0
## AZ_B9            1.000000000                      0                     0
##       prediction.score.ductal prediction.score.endothelial
## AZ_A2                       0                            0
## AZ_B9                       0                            0
##       prediction.score.activated_stellate prediction.score.schwann
## AZ_A2                                   0                        0
## AZ_B9                                   0                        0
##       prediction.score.mast prediction.score.macrophage
## AZ_A2                     0                           0
## AZ_B9                     0                           0
##       prediction.score.epsilon prediction.score.quiescent_stellate
## AZ_A2                        0                                   0
## AZ_B9                        0                                   0
##       prediction.score.max
## AZ_A2            0.9913817
## AZ_B9            1.0000000

Predict cell types for query

The cell type with highest score was assigned to the given cell. We can visualize this score with a heatmap.

mat <- as.matrix(pred_res[, -c(1, 15)])
colnames(mat) <- gsub("prediction.score.", "", colnames(mat))
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)

Load cell types into query

pred_cellType <- setNames(pred_res$predicted.id, rownames(pred_res))
panc_list$query[["cellType_predBySeurat"]] <- pred_cellType[match(Cells(panc_list$query),
    names(pred_cellType))]

head(panc_list$query, 2)
##       orig.ident nCount_RNA nFeature_RNA      tech replicate assigned_cluster
## AZ_A2         AZ     384801         3611 smartseq2 smartseq2             <NA>
## AZ_B9         AZ     654549         4433 smartseq2 smartseq2             <NA>
##       celltype   dataset cellType_predBySeurat
## AZ_A2    gamma smartseq2                 gamma
## AZ_B9    alpha smartseq2                 alpha
table(panc_list$query$cellType_predBySeurat)
## 
##             acinar activated_stellate              alpha               beta 
##                192                 55               1028                316 
##              delta             ductal        endothelial              gamma 
##                122                439                 21                200 
##         macrophage               mast quiescent_stellate            schwann 
##                  7                  7                  5                  2

Annotation with SingleR

SingleR is bioconductor package which can be used to predict annotations of a single-cell dataset.

This package uses SingleCellExperiment objects, so we need to convert our Seurat object.

sce_list <- lapply(panc_list, function(seu) {
    sce <- as.SingleCellExperiment(seu, assay = "RNA")
    return(sce)
})

Annotation with SingleR

The score is generated comparing the expression levels of a cell in query dataset and the expression pattern of certain group (eg. cell types) in reference dataset. A cell would be assigned as the cell type which has highest score.

library(SingleR)
pred_res <- SingleR(ref = sce_list$ref, test = sce_list$query, labels = sce_list$ref$celltype)
head(pred_res, 2)
## DataFrame with 2 rows and 4 columns
##                               scores      labels delta.next pruned.labels
##                             <matrix> <character>  <numeric>   <character>
## AZ_A2 0.300904:0.178568:0.436055:...       gamma  0.1394303            NA
## AZ_B9 0.318227:0.227997:0.529730:...       alpha  0.0861217         alpha

Annotation with SingleR

By converting to a matrix, we can check the cell type scoring using a heatmap.

mat <- as.matrix(pred_res$scores)
rownames(mat) <- rownames(pred_res)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)

Import SingleR annotation

Lastly we want to add our annotation back to our query dataset.

cell_type <- setNames(pred_res$pruned.labels, rownames(pred_res))
panc_list$query$cellType_predBySingleR <- cell_type[match(Cells(panc_list$query),
    names(cell_type))]
head(panc_list$query, 2)
##       orig.ident nCount_RNA nFeature_RNA      tech replicate assigned_cluster
## AZ_A2         AZ     384801         3611 smartseq2 smartseq2             <NA>
## AZ_B9         AZ     654549         4433 smartseq2 smartseq2             <NA>
##       celltype   dataset cellType_predBySeurat cellType_predBySingleR
## AZ_A2    gamma smartseq2                 gamma                   <NA>
## AZ_B9    alpha smartseq2                 alpha                  alpha

Seurat vs SingleR annotation

First we can compare this back to the original annotation. We will look for how many overlap.

Seurat annotation:

table(panc_list$query$cellType_predBySeurat == panc_list$query$celltype)
## 
## FALSE  TRUE 
##    41  2353

SingleR annotation:

table(panc_list$query$cellType_predBySingleR == panc_list$query$celltype)
## 
## FALSE  TRUE 
##    35  2299

Seurat vs SingleR annotation

Next we can compare our two annotations:

table(panc_list$query$cellType_predBySeurat == panc_list$query$cellType_predBySingleR)
## 
## FALSE  TRUE 
##    37  2297
tbl <- table(panc_list$query$cellType_predBySeurat, panc_list$query$cellType_predBySingleR)
pheatmap::pheatmap(tbl, scale = "row")

Seurat vs SingleR annotation

This isn’t always the case. This is good demonstration data.

  • Seurat: Wins on speed
  • SingleR: More reliable and consistent

Trajectory and Psuedotime analysis


Trajectory and pseudotime analysis

  • Many biological processes are link a dynamic changes in the cellular state e.g. immune cell activation.
  • Trajectory analysis is a way to postulate this kind dynamic changes in single-cell data.
  • According to the postulated trajectories, we can also estimate pseudotime of each cell included in a particular trajectory.
  • Formal definition of pseudotime: Latent (unobserved) dimension which measures the cells’ progress through the transition.
  • Related to but not necessarily the same as laboratory capture time.

How does it work?

  • PCA is used to determine which variables (cells) have maximum variance - and which genes drive these differences
  • Trajectory inference / pseudotime algorithms apply downstream statistical methods to linearly transform the cells within each PC and categorize cells into lineages with different end cellular states

]

]

Prepare the demo

  • Here, we will find out trajectories reflecting hematopoietic stem cell differentiation.
  • In this practice, we are using the SMART-Seq data of mouse hematopoietic stem cells (HSCs) published Nestorowa et al., Blood(2016) link.
  • We will provide the SingleCellExperiment.
  • The SingleCellExperiment (sce) object was downloaded and processed by following this vignette link.
sce <- readRDS("data/sceOBJ_Nestorowa_HSC.rds")
sce
## class: SingleCellExperiment 
## dim: 46078 1656 
## metadata(0):
## assays(2): counts logcounts
## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000107391 ENSMUSG00000107392
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810
## colData names(5): cell.type FACS sizeFactor label cellType
## reducedDimNames(3): diffusion PCA UMAP
## mainExpName: endogenous
## altExpNames(1): ERCC

Estimate trajectories

To find the trajectories we need to fit our high dimensional data onto a curve. This step is processed with the slingshot link package.

To run slingshot we need: * Our sce object: sce * Clusters: sce$label * Our choice of dimension reduction approach: PCA * To reduce computational load the adjacent 100 cells are aggregated into single point (approx_points=100) * To avoid connecting unrelated trajectories, OMEGA clustering is introduced (omega=TRUE).

library(slingshot)
sce.sling <- slingshot(sce, cluster = sce$label, reducedDim = "PCA", approx_points = 100,
    omega = TRUE)

Estimate trajectories

colData(sce.sling)[1, 1:5]
## DataFrame with 1 row and 5 columns
##                     cell.type                        FACS sizeFactor    label
##                   <lgCMatrix>                    <matrix>  <numeric> <factor>
## HSPC_025 FALSE:FALSE:TRUE:... 27675.2:1.17991:1.12999:...   0.499986        3
##              cellType
##           <character>
## HSPC_025 Erythrocytes

Extract informations for each trajectory

Slingshot has fitted principle curves. We now will embed these cruves in UMAP space embedCurves(). We can see the number o lineages detected at this point.

embedded <- embedCurves(sce.sling, "UMAP")

embedded@metadata$lineages
## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
## 
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
## 
## $Lineage3
## [1] "7"

Estimate pseudotime

Once we have lineages, we can then find the position of each cell along these trajectories. This is estimating the pseudotime.

pseudo.paths <- slingPseudotime(sce.sling)
head(pseudo.paths, 2)
##           Lineage1 Lineage2 Lineage3
## HSPC_025 111.83001       NA       NA
## HSPC_031  96.15946 99.77945       NA

It will be useful to combine these times, to find the shared time.

avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling)$avg_pseudoTime <- avg_pseudoTime

Estimate pseudotime

Once we have the average pseudotime we can always project these values on to our UMAP.

plotUMAP(sce.sling, colour_by = "avg_pseudoTime")

Plotting principle curves

The slingCurves() function can fetch essential information of each principle curve. The results are inside a list, with each principle curve in an element. For each principle curve, we can extract the UMAP data and the order of cells for plotting.

embedded_curve <- slingCurves(embedded)

curve <- lapply(embedded_curve, function(x) {
    dat <- data.frame(x$s[x$ord, ])  # UMAP axis and the order of cells
    return(dat)
})
names(curve) <- c("curve1", "curve2", "curve3")

head(curve$curve1)
##       UMAP1     UMAP2
## 1 -12.23250 0.6437522
## 2 -12.01319 0.7200359
## 3 -11.79387 0.7962985
## 4 -11.57454 0.8725365
## 5 -11.35520 0.9487377
## 6 -11.13584 1.0248795

Plotting principle curves

We can now just add our curve information directly to our pseudotime UMAP plot using ggplot parameters geom_path().

plotUMAP(sce.sling, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
    aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
    y = UMAP2), color = "black") + geom_path(data = curve$curve3, aes(x = UMAP1,
    y = UMAP2), color = "red")

Refine the estimations

The results indicated that lineage 3 is an outlier. If we look back at our lineage information this contains cluster 7.

embedded@metadata$lineages
## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
## 
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
## 
## $Lineage3
## [1] "7"

Refine the estimations

We can remove cluster 7 and re-evaluate trajectories again. First we rerun the pseudotime estimation.

sce2 <- sce[, colData(sce)$label != 7]
sce.sling2 <- slingshot(sce2, cluster = sce2$label, reducedDim = "PCA", approx_points = 100,
    omega = TRUE)
pseudo.paths <- slingPseudotime(sce.sling2)
avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling2)$avg_pseudoTime <- avg_pseudoTime

Refine the estimations

Next we extract out the princple curves again.

embedded <- embedCurves(sce.sling2, "UMAP")
embedded_curve <- slingCurves(embedded)
curve <- lapply(embedded_curve, function(x) {
    dat <- data.frame(x$s[x$ord, ])  # UMAP axis and the order of cells
    return(dat)
})
names(curve) <- c("curve1", "curve2")

Refine the estimations

plotUMAP(sce.sling2, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
    aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
    y = UMAP2), color = "black")

Looking at a single lineage

So far we have visualized the lineages together. We will want to break this down and compare to clusters. First we will look at clusters.

plotUMAP(sce.sling2, colour_by = "label")

Looking at a single lineage

We can reveal which cluster belong to each cluster and lineage 1’s trajectory.

embedded@metadata$lineages$Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_1") + geom_path(data = curve$curve1,
    aes(x = UMAP1, y = UMAP2))

Looking at a single lineage

We can reveal which cluster belong to each cluster and lineage 2’s trajectory.

embedded@metadata$lineages$Lineage2
## [1] "8" "5" "9" "1" "2" "6"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_2") + geom_path(data = curve$curve2,
    aes(x = UMAP1, y = UMAP2))

Identify driver genes for each lineage

Driver genes were the genes whose expression level changes reflect the pseudotime changes. We can find these using testPseudotime() from TSCAN. Using this the gene expression and pseudotime are fitted with a non-linear model, followed by testing with ANOVA.

We will start by just looking at lineage 1.

library(TSCAN)
res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_1)

Identify driver genes for each lineage

We can see the results. Often we will want to reorder it too prioritize the biggest changers.

res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
head(res)
## DataFrame with 6 rows and 4 columns
##                        logFC      p.value          FDR             SYMBOL
##                    <numeric>    <numeric>    <numeric>        <character>
## ENSMUSG00000021728 0.0678996  0.00000e+00  0.00000e+00 ENSMUSG00000021728
## ENSMUSG00000016494 0.0654681  0.00000e+00  0.00000e+00 ENSMUSG00000016494
## ENSMUSG00000040747 0.0639249 7.23447e-296 5.16939e-293 ENSMUSG00000040747
## ENSMUSG00000057729 0.0619007  0.00000e+00  0.00000e+00 ENSMUSG00000057729
## ENSMUSG00000021998 0.0610830  0.00000e+00  0.00000e+00 ENSMUSG00000021998
## ENSMUSG00000090164 0.0608169 4.81695e-237 1.76510e-234 ENSMUSG00000090164

Identify driver genes for each lineage

We can then categorize our data based on significance and FC.

  • Recommend FDR threshold: 0.05
  • The logFC reflects the trend of gene expression toward pseudotime.
    • logFC > 0, gene expression increased through increasing pseudotime [related to late stage]
    • logFC < 0, gene expression decreased through increasing pseudotime [related to early stage]
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"
res[1:2, ]
## DataFrame with 2 rows and 5 columns
##                        logFC   p.value       FDR             SYMBOL        stat
##                    <numeric> <numeric> <numeric>        <character> <character>
## ENSMUSG00000021728 0.0678996         0         0 ENSMUSG00000021728        late
## ENSMUSG00000016494 0.0654681         0         0 ENSMUSG00000016494        late

Identify driver genes for each lineage

We can quickly and easily see how many are upregulated/downregulated across the trajectory.

table(res$stat)
## 
## early  late    nc 
## 11261  8848 25969

Top driver genes

Using the order we can easily isolate the driver genes. Here we can see the top 5 early genes.

sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
top
## [1] "ENSMUSG00000054191" "ENSMUSG00000004655" "ENSMUSG00000027556"
## [4] "ENSMUSG00000031877" "ENSMUSG00000031162"

Plotting top driver genes

With plotExpression() we can look at the expression of specific genes i.e. our top genes. We can see how each gene is expressed across all cells and psuedotime.

plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Plotting top driver genes

We can now repeat this for the top 5 late genes.

sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Plotting top driver genes

We can now repeat this for the top 5 late genes.

Identify driver genes

Now lets repeat this whole process for lineage 2.

res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_2)
res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"

table(res$stat)
## 
## early  late    nc 
## 11566  8450 26062

Plotting top driver genes

sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Plotting top driver genes

sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
    colour_by = "label")

Choosing a trajectory inference and pseudotiming algorithm

full image

CITE-Seq


CITE-Seq

The CITE-Seq method labels different types of cells or different samples with hashtag-labeled antibodies. Then, the cells can be pooled together into a single library and be sequencing at the same time. After sequencing, the cells can be separated by the different antibody hashtags.

  • Here, we will use a dataset from Seurat vignette link

Load data

rna.mat <- readRDS("data/pbmc_umi_mtx.rds")
dim(rna.mat)
## [1] 27117 16916
hto.mat <- readRDS("data/pbmc_hto_mtx.rds")
dim(hto.mat)
## [1]     8 16916
rownames(hto.mat)
## [1] "HTO_A" "HTO_B" "HTO_C" "HTO_D" "HTO_E" "HTO_F" "HTO_G" "HTO_H"

Prepare data

seu_obj <- CreateSeuratObject(counts = rna.mat, project = "citeSeq_demo")
seu_obj[["HTO"]] <- CreateAssayObject(counts = hto.mat)
seu_obj
## An object of class Seurat 
## 27125 features across 16916 samples within 2 assays 
## Active assay: RNA (27117 features, 0 variable features)
##  1 layer present: counts
##  1 other assay present: HTO

Cluster with regular workflow

DefaultAssay(seu_obj) <- "RNA"
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)

Cluster with regular workflow

seu_obj <- quick_clust(seu_obj)
DimPlot(seu_obj, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()

Centered log-ratio transformation

  • Hashtag counts are a kind of compositional data, which define the proportion of hashtags in each cell barcode.
  • CLR is a method to process compositional data. It is a log-ratio transformation that centers the data around the geometric mean of each cell barcode.
DefaultAssay(seu_obj) <- "HTO"
seu_obj <- NormalizeData(seu_obj, assay = "HTO", normalization.method = "CLR")

Differentiate Hashtags

  • Demultiplex HTOs with HTODemux()
  • Threshold for positive call: 0.99 quantile
seu_obj <- HTODemux(seu_obj, assay = "HTO", positive.quantile = 0.99)

head(seu_obj, 2)
##                    orig.ident nCount_RNA nFeature_RNA nCount_HTO nFeature_HTO
## AGGCCACAGCGTCTAT citeSeq_demo        273          210       4200            7
## ATTGGTGAGTTCGCAT citeSeq_demo        305          174       3475            8
##                  RNA_snn_res.0.5 seurat_clusters HTO_maxID HTO_secondID
## AGGCCACAGCGTCTAT               0               0     HTO-H        HTO-E
## ATTGGTGAGTTCGCAT               2               2     HTO-H        HTO-G
##                  HTO_margin HTO_classification HTO_classification.global
## AGGCCACAGCGTCTAT 4.71133972              HTO-H                   Singlet
## ATTGGTGAGTTCGCAT 0.03995001        HTO-G_HTO-H                   Doublet
##                  hash.ID
## AGGCCACAGCGTCTAT   HTO-H
## ATTGGTGAGTTCGCAT Doublet

Decide postive hashtags ~ use HTO-A as example

  • Use the “negative” distribution to calculate cut-off value (99% quantile).
  • Every cell barcode with HTO-A >= cut-off was assigned as HTO-A positive.
# Distribution of HTO-A level
RidgePlot(seu_obj, features = "HTO-A", group.by = "orig.ident") + NoLegend()

Demutiplex result

  • The demultiplex results are in the column HTO_classification.global and hash.ID
  • In the HTO_classification.global:
    • The Singlet means the cell barcode with only one particular hashtag. It can be applied to further analysis.
    • The Negative means no hashtags passed cut-off in the cell barcodes.
    • The Doublet means multiple hashtags passed cut-off in the cell bacodes.
      • The Doublets here didn’t mean multiple cells in a single droplet.
  • In the column hash.ID, particular hashtag for each cell barcode assigned in the column hash.ID. The Negative or Doublet were the same as in the column HTO_classification.global.

Demutiplex result

RidgePlot(seu_obj, features = c("HTO-A", "HTO-B"), group.by = "hash.ID") + NoLegend()

#
table(seu_obj$HTO_classification.global)
## 
##  Doublet Negative  Singlet 
##     2598      346    13972
#
table(seu_obj$hash.ID)
## 
##  Doublet    HTO-H    HTO-D    HTO-E    HTO-G    HTO-F    HTO-B    HTO-C 
##     2598     1808     1716     1487     1660     1520     1993     1873 
##    HTO-A Negative 
##     1915      346
#
table(seu_obj$HTO_classification.global, seu_obj$hash.ID)
##           
##            Doublet HTO-H HTO-D HTO-E HTO-G HTO-F HTO-B HTO-C HTO-A Negative
##   Doublet     2598     0     0     0     0     0     0     0     0        0
##   Negative       0     0     0     0     0     0     0     0     0      346
##   Singlet        0  1808  1716  1487  1660  1520  1993  1873  1915        0

UMAP ~ split by Hashtags

DimPlot(seu_obj, group.by = "seurat_clusters", label = TRUE, pt.size = 0.2, split.by = "hash.ID",
    ncol = 5) + NoLegend()

For the cells were not assigned a particular hashtag

  • The cell barcodes with Negattive or Doublet would not be used for further analysis directly.
    • Generally, the mis-labeled rate in CITE-Seq would be ranged from 10% ~ 80% (avg 30%).
    • If you use hashtag to label cell types, we could try to rescue the mis-labeled cells by using the clustering in UMAP. [assume a given cell type shall group together]
    • If you use hashtags to label samples, it would be safer to exclude these cells for following analysis.

SMART-Seq


SMART-seq - a 10X alternative

Schema

SMART-seq - a 10X alternative

  • Pros
    • Full-length coverage of transcripts
    • Increased depth
    • Differential transcript usage
  • Cons
    • Much more expensive
    • More complicate bench work
  • Applications
    • Generic clustering
    • Differential gene expression
    • Alternative splicing

Demo data

  • Single-cell quantification of a broad RNA spectrum reveals uniqure noncoding patterns associated with cell types and states. Isaknova A, Neff N, and Quake SR, PNAS (2021). link

  • Counting matrix of mouse embryo data was download from GEO GSE151334.

  • Data process workflow was recorded on GitHub. link

Alignment and counting

  • The SMART-Seq data pre-processing is quite similar to bulk RNA-seq, including align to reference genome and counting. There are two strategy to process from FQ files.

    • Align to reference genome by using STAR and then make counting with featureCounts [like this demo data].
    • Pseudo-alignment and counting by using salmon and then gathering counting matrix with tximport.
  • Details of the two methods can be found in the BRC training course for bulk RNAseq. Please refer to this link.

  • When considering whether to do pseudo-alignment or alignment, we do have to consider to pros and cons of these approaches. Pseudoaligner is much faster, but alignment gives you an exact location of the read. This can be important if you suspect RNA degradation or similar QC issues as you can check the 5’-3’ bias of the reads. This is not possible with 10X.

Read count matrix (in tsv file) into matrix

mtx <- read.delim("data/GSE151334_counts.mouse.tsv.gz")
mtx[1:2, 1:5]
##               mESC_EB_d0_A11_S11 mESC_EB_d0_A12_S12 mESC_EB_d0_A13_S13
## 0610005C13Rik                  0                  0                  0
## 0610006L08Rik                  0                  0                  0
##               mESC_EB_d0_A14_S14 mESC_EB_d0_A15_S15
## 0610005C13Rik                  0                  0
## 0610006L08Rik                  0                  0

Load matrix into Seurat object

Once we have loaded the matrix, we can create a Seurat object. After this point we will be following a similar workflow to the 10X data. Here we start by finidng the mitochondrial content.

library(Seurat)
seu <- CreateSeuratObject(mtx, project = "GSE151334", min.cells = 3, min.features = 300)
seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^mt-")
seu <- SCTransform(seu, variable.features.n = 2000)

Assess doublets

library(reticulate)
reticulate::py_install("scrublet")
scr <- reticulate::import("scrublet")

mat <- GetAssayData(seu, assay = "RNA", slot = "counts")
mat <- as.matrix(mat)
scr <- reticulate::import("scrublet")
scrub <- scr$Scrublet(t(mat))
doublet <- scrub$scrub_doublets()
names(doublet) <- c("doublet_score", "doublet")
seu[["doublet_score"]] <- doublet$doublet_score
seu[["doublet"]] <- doublet$doublet

Estimate cell cycle phases

Lets read in annotated cell cycle genes from the Regev lab. This list of genes was curated by the authors of Seurat, and is a subset of the genes used in the original Cell Cycle paper.

Data normalization and quick clustering

set.seed(42)
seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
seu <- FindNeighbors(seu, dims = 1:15, reduction = "pca")
seu <- RunUMAP(seu, dims = 1:15, reduction = "pca")
seu <- FindClusters(seu, resolution = 0.5)

Quality assessment

We can see here that this is a highly variable dataset. We can also see that there are some cells with high mitochondrial content in some samples. We will filter these out later.

library(ggplot2)
VlnPlot(seu, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "doublet_score",
    "S.Score", "G2M.Score"), group.by = "seurat_clusters")

Quality assessment

FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "seurat_clusters")

Quality assessment

Another view of our high mitochondrial content.

FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "percent.mt", group.by = "seurat_clusters") +
    geom_hline(yintercept = 10, color = "black", linetype = "dashed")

Quality assessment

table(seu$doublet == "TRUE" | seu$percent.mt > 10)
## 
## FALSE  TRUE 
##  1559   373

Filtering low quality cells

set.seed(42)
seu <- subset(seu, subset = percent.mt <= 10 & doublet == "FALSE")
seu <- SCTransform(seu, variable.features.n = 2000, vars.to.regress = c("percent.mt"))
seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
seu <- RunUMAP(seu, dims = 1:15, reduction = "pca")
seu <- FindNeighbors(seu, dims = 1:15, reduction = "pca")
seu <- FindClusters(seu, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 1559
## Number of edges: 42165
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8941
## Number of communities: 10
## Elapsed time: 0 seconds

Filtering low quality cells

DimPlot(seu, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()

Marker genes by clusters

markers <- FindAllMarkers(seu, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
library(dplyr)
top_genes <- markers %>%
    group_by(cluster) %>%
    slice_max(n = 5, order_by = avg_log2FC)

Marker genes by clusters

DoHeatmap(seu, features = c(top_genes$gene))

Add annotation by using known marker genes

VlnPlot(seu, features = c("Nanog", "Pou5f1"), ncol = 2, pt.size = 0)

Add annotation by using known marker genes

VlnPlot(seu, features = c("Pax6", "Stau2"), ncol = 2, pt.size = 0)

Add annotation by using known marker genes

VlnPlot(seu, features = c("Epcam", "Gata6"), ncol = 2, pt.size = 0)

Add annotation by using known marker genes

VlnPlot(seu, features = c("Dlk1", "Postn"), ncol = 2, pt.size = 0)

Add annotation by using known marker genes

seu[["layers"]] <- NA
seu$layers[seu$seurat_clusters %in% c(0, 4, 5)] <- "primary_ES"
seu$layers[seu$seurat_clusters %in% c(2, 3, 10, 11, 12)] <- "ectoderm"
seu$layers[seu$seurat_clusters %in% c(1, 9)] <- "mesoderm"
seu$layers[seu$seurat_clusters %in% c(6, 7, 8)] <- "endoderm"

Add annotation by using known marker genes

DimPlot(seu, group.by = "layers", pt.size = 0.2)